Method and apparatus for pointing to documents electronically using features extracted from a scanned icon representing a destination

ABSTRACT

An example page taken from each document in a document database is processed by a page processor to yield an iconic representation for the example page. To form the iconic representation, the example page is segmented into text regions, line art regions, photograph regions, etc., and each region is reduced in a manner appropriate for that image type. Text is replaced with a block font and reduced, while graphics are reduced in level and/or spatial resolution. The reduced regions of the example page are then reassembled into the icon. When multiple icons are printed on a guide page, a user can visually identify the icon for an example page of a target document and supply the icon, or a label for the icon, to a document retrieval system, which selects candidate matching documents from the document database. For simplified processing, characters can be blocked and words formed into solid line segments with lengths proportional to word lengths.

This is a Division of application Ser. No. 08/431,059, filed Apr. 28,1995, which issued as U.S. Pat. No. 5,717,940 on Feb. 10, 1998.

BACKGROUND OF THE INVENTION

The present invention relates to the field of document storage andretrieval, in particular, the retrieval of a document from a documentdatabase using content from an example page taken from the document.

A general approach to the problem of retrieving a target document fromdocument database is to store a set of key words with each documenteither physically with the document or, more probably, in a lookup tablein which the keys are indexed and table entries point to documents inthe database. Keys can be easily generated from documents if electronicversions of documents are available. If only paper versions of thedocuments are available, they can be scanned to form digital images ofthe pages of the documents and the digital images can be processed by acharacter recognizer to extract the text of the document and thus thekeys. In a more labor-intensive system, the keys can be manuallyentered.

To retrieve a document, the keys are supplied to a search engine. Wherea user is not likely to remember the keys for every document stored inthe database, the user can retain an example page from each document asit is stored and supply that example page to a page analyzer for keyextraction.

The disadvantage of this general approach is that the documents in thedocument database and the example pages either need to originate andremain in electronic form, or character recognition would need to bedone on example pages to determine the keys. Thus, either the examplepage needs to be electronic or has to be of sufficient quality thaterrors do not occur in the scanning and character recognition process.

One example of a prior art system for document presentation is theRightPages document presentation system described in G. Story, “TheRight Pages Image-Based Electronic Library for Alerting and Browsing”,COMPUTER, Sept. 1992. In that system, a user is presented with a seriesof journal covers and the user browses the journal covers to find adesired journal, then browses its table of contents and then selects anarticle from the journal. Once an example page of a journal article isselected, the system retrieves the target article from a documentdatabase. The disadvantage to the RightPages system is that the iconsare presented on a computer monitor and therefore are lower resolutionthan print, and the links between the journal covers and the pages mustalready exist. Thus, the user must be at the computer monitor to browseexample pages.

The document storage and retrieval system taught by U.S. Pat. No.5,465,353 to Hull, et al., entitled “IMAGE MATCHING AND RETRIEVAL BYMULTI-ACCESS REDUNDANT HASHING” (commonly owned by the assignees of thepresent application, incorporated by reference herein, and hereinafter“Hull”) is a system for retrieving a target document from a documentdatabase by submitting a paper example page retained from the targetdocument to a search engine. The search engine analyzes the example pageand determines likely matches among the documents in the database. Wheremany, documents are to be stored however, storage and organization ofthe example pages raises some of the same problems that documentdatabase storage tries to alleviate, such as having to allocate storagespace for paper pages and keeping them organized.

Thus, what is needed is a system for efficiently storing example pagesfor use in document retrieval and document management.

SUMMARY OF THE INVENTION

An improved document server is provided by virtue of the presentinvention. A document server is a computer system which maintains adatabase of documents, either in a structured form such as editablecomputer files, as digitized images of paper pages from the documents,or a combination of both. A target document is a document in thedocument database whose retrieval is desired. To retrieve the document,an input is provided to the document server indicating one or morecharacteristics of the target document, such as keys, an unique label,or an example page. Typically, a document is provided to the documentserver and only one page is retained. The retained page can then serveas the example page, to be provided when the entire document is desired.An example page could be the first page of the document, but it need notbe the first page, nor even a complete page of the document, so long asthe example page (or page portion) could be used to distinguish thetarget document from the other documents in the document database, or atleast to identify a set of candidate matching documents which closelymatch the target document and can be presented to a user for selectionof the target document from among the candidate matching documents.

In one embodiment of a document server according to the presentinvention, an example page for each document in a document database isprocessed by a page processor to generate an icon, i.e, an iconicrepresentation, of an example page of the document. Typically, this isdone at the time the document is first stored in the document database.The page processor analyzes an example page to segment regions of theexample page according to image types, such as text, line art,photographs, other graphics, borders, colored areas, glyphs, bar codes,etc. Of course, not all image types need be found in all example pagesand image types are not limited to those mentioned here. Once segmented,each region is characterized and reduced in a manner appropriate for theimage type of the region. For example, text in text regions is replacedwith a block font (defined below) and reduced, while graphics regionsare reduced in resolution (by lowering pixel precision and/or the numberof pixels per unit area). The reduced regions of the example page arethen reassembled into an icon of the example page.

In a specific application of the present invention, many icons areprinted on a single page, referred to herein as a “guide” page. Thisguide page, or multiple guide pages depending on the number of icons, isprovided to a user. To retrieve a document, the user visually scans theguide page to find an icon which is visually associated with the targetdocument and then supplies an indication of the selected icon to thedocument server. The document server analyzes the contents of the iconto detect distinguishing features of the example page represented by theicon and provides those features to a search engine. The search enginethen identifies candidate matching documents in the document database.If more than one candidate matching document is returned, the documentserver provides information about each candidate, such as a thumbnailimage of a portion of the candidate document, so that the user canmanually select the target document from the candidate matchingdocuments.

Alternatively, each icon could be assigned an identifying label, such asa unique alphanumeric code or machine-readable bar code, which the userprovides to the document server for a lookup of the target document.Although the document server does not need to use the content of theicon image for document retrieval, the content of the icon isnonetheless useful to the user, to provide compact visual cues to thetarget document. With a guide page, the user can scan many iconsquickly. Because of the page reduction process, the distinguishingfeatures of the example documents are preserved over the iconizationprocess, and the icons can be made smaller while still allowingdistinguishing features to be distinguishable to the user. Instead ofeach icon having a unique identifier, the icon might be specified by aunique identifier for the guide page on which it is found and the icon'slocation (e.g., row/column) on the guide page.

Variations of the above embodiments are envisioned. For example, thedocument server might be integrated with a digital copier to allow thedigital copier to output an entire document in response to a usersubmitting a guide page with an icon circled. The digital copier wouldscan the submitted guide page and either extract information from thecontent of the icon, or extract a guide page identifier and determinethe icon's location on the guide page. Where decentralized documentservers are used and different guide pages are used by different usersfor the same documents, the former option of identifying the icon onlyfrom the icon contents is the preferred approach. The interface forscanning icons and printing documents could be an ordinary facsimilemachine, thus allowing global, remote document retrieval.

In some embodiments, multiple icons might be provided for a document toincrease the possibility that the user would find a recognizable portionof the document. This is preferred where the number of guide pages oricons is not critically constrained. Also, if desired, the documentserver can give the user a choice to retrieve less than all of thetarget document, such as when the target document is to be printed butthe user only requires a few pages of a long document.

In a specific embodiment of the page processor, characters are blocked,interword spaces are identified, and the characters of words arereplaced with a line whose length is proportional to the word's length.This is one method a reducing the error rate when extracting wordlengths from icons. One advantage to reducing the error rate is that itallows correspondingly smaller icons to be used.

Icons can also be used as a paper interface to eliminate the need forother types of data entry, such as the entry of a selection from a listto retrieve a data element such as a telephone number or electronic mailaddress, instead of using the icon to identify a document.

A further understanding of the nature and advantages of the inventionsherein may be realized by reference to the remaining portions of thespecification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a document storage and retrieval systemincluding a page processor;

FIG. 2 is a more detailed block diagram of the page processor shown inFIG. 1;

FIG. 3 is an illustration of an example page;

FIG. 4 illustrates an icon representing the example page shown in FIG.3;

FIG. 5 is an illustration of a guide page which includes the icon shownin FIG. 4;

FIG. 6 is a flow chart of a process for storing documents in a documentdatabase including the creation of paper guide pages; and

FIG. 7 is a flow chart of a process for retrieving a document from adocument database using a guide page.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 shows a document server 10 according to one embodiment of thepresent invention. Document server 10 accepts input documents, such asinput document 12, for storage and responds to user requests fordocuments. Three user requests are represented in FIG. 1 by an icon 14,a label 16 and a guide page 18, respectively, although other forms ofrequests are possible including combinations of the requests shown. Theuser request is a request for a specific document stored by documentserver 10, such as target document 20 shown in FIG. 1. Document server10 supplies target document 20 based only on the input request or, ifnecessary, on further prompting of the user to select among a set ofclosely matching documents (candidate matching documents). If inputdocuments 12 are paper documents, they are scanned into digital imagesby a scanner 30 before being provided to a document storage unit 32.Otherwise, if input documents 12 are supplied in electronic form, theyare provided directly to document storage unit 32, without the need forscanning. Document storage unit 32 processes an input document 12 andgenerates an icon 34 for input document 12, places the digitalrepresentation of input document 12 into a document database 36 whilegenerating document indexing data to be stored into a document indextable 38.

Document storage unit 32 comprises a page processor 40, which generatesicons such as icon 42, a key generator 44, and an optional iconserializer 46. Page processor processes an example page taken fromdocument 12 being input to document server 10, to form an icon. Thisprocess is described in more detail below. Key generator 44 extractsinformation from the input document 12 and generates the keys used tolocate the document 12 after storage. In some cases, key generator 44will scan the text of a document 12 if it is a structured document (orwill first do character recognition), but will also generate keys basedon descriptors as taught by Hull. These generated keys are stored indocument index table 38 along with a pointer to the location of document12 in document database 36.

Where icon identifiers are used, an icon identifier is generated by iconserializer 46 and attached to the icon 42, which is output in a formusable by a document requestor as icon 34. Icon serializer 46 generallyincrements the number or code used to identify a particular icon, andthe number or code is also sent to document index table 38 to be used asa key for the document 12. Icon serializer 46 can be initialized asneeded to account for changes in sequences. One use for sequencechanging is that each user might maintain a guide page of that user'sdocuments and might desire sequential numbers for that user's icons. Inthis case, the guide page might be provided as the first page of a scanjob, and either page processor 40 or key generator 44 would recognize itas a guide page, extract the serialization of the existing icons andpass that information on to icon serializer 46 so that the next iconcould be serialized in order. Of course, the entire guide page can beprovided to icon serializer 46 so that icon 42 could be added to theguide page and a new guide page containing icon 42 and all previousicons from that guide page could be printed for the user.

Documents are retrieved from document database 36 by a documentretrieval unit 50 of document server 10 which accepts a user request andresponds with target document 20. Although user requests are shown inFIG. 1 being provided directly to document retrieval unit 50, the userrequests might also be provided remotely, such as over a network or viaa facsimile machine. Document retrieval unit 50 is shown with ananalysis engine 52, a search engine 54 and a presentation engine 56.Analysis engine 52 is coupled to accept user requests and is coupled tosearch engine 54 to provide features of the request to the searchengine, as explained in more detail below. Search ngine 54, in turn, iscoupled to document index table 38 to send keys and receive pointers tomatching documents. Search engine 54 is also coupled to presentationengine 56 to send lists of candidate documents (some lists might containonly one document where the keys are sufficient to uniquely identify thetarget document). Presentation engine 56 is also coupled to documentdatabase 36 in order to retrieve documents therefrom, and is coupled tovarious output devices (not shown), such as a digital copier, a computerdisplay, a printer, a facsimile machine or an electronic mail server.

In operation, the user request is supplied to analysis engine 52. If theuser request is in the form of an icon, the analysis engine extractsinformation from the content of the icon. If the user request is in theform of an icon identifier (icon ID or guide page ID and icon position),then that identifier is the feature used. Analysis engine 52 providesthe extracted feature(s) to search engine 54.

Search engine 54 uses the extracted features to generate keys forsearching for the target document. Hull teaches the storing of hashedredundant descriptors of a document, which in this case would serve therole of keys. Where an icon identifier is used instead of the iconcontents, that identifier is used as the key. The key is used to indexinto document index table 38 to retrieve a list of one or more matchescorresponding to the candidate matching documents. Where an iconidentifier is used, there will normally be only one candidate matchingdocument, however is a system where one icon might select for multipleversions of a document, more than one candidate matching document mightexist.

Search engine 54 provides the list of matches to presentation engine 56.Presentation engine 56 then retrieves the candidate matching documentsfrom document database 36 and presents them according to presentationinstructions provided in the user requests. For example, the user mightrequest to view the document on the computer monitor or request that itbe printed. Where a digital copier is used, requesting that the documentbe printed might be transparent, i.e., the user requests the documentand the digital copier assumes that it is to be printed. Presentationengine 56 might also include an interactive interface to allow the userto browse through the candidate matching documents or thumbnails of thecandidates, and accept a selection from a keyboard or mouse indicatingwhich is the desired document.

FIG. 2 shows page processor 40 in greater detail. Page processor 40accepts a digital representation of a page 100 as its input and outputsan icon 102 as the iconic representation of page 100. In the figure,page 100 is shown with a text region 104 and a graphics region 106, andpage processor 40 is shown with a segmentation analyzer 108, and textreducer 110, a graphics reducer 112 and a page reassembler 114.Segmentation analyzer 108 produces, from the input page 100, a map 116of the different regions of page 100, which in this case contains justone text region and one graphics region. Of course, a typical documentmight contain more complex pages with more varied regions.

Page 100 and map 116 are provided to text reducer 110 and graphicsreducer 112. Alternatively, to save transmission time or storage space,page 100 could be separated beforehand into subpages for each type ofregion found. Either way, the particular reducer operates only on it'sregion type. While only two reducers are shown, other reducers mightalso be used. For example, if segmentation analyzer 108 detected aregion of glyphs (machine readable marks) or bar codes, a glyph or barcode reducer would be used. That reducer would simply read theinformation encoded in the glyphs or bar code and generate machinereadable marks encoding that information in less area.

Once each of the regions are reduced, they are recombined by pagereassembler 114 to form icon 102. One general method of pagesegmentation is shown by Cullen, J. F., and Ejiri, K., “WeakModel-Dependent Page Segmentation and Skew Correction for ProcessingDocument Images”, Proc. of the 2nd Internat. Conf. on Doc.

Anal. and Recog. 757-60 (1993).

The particular method of reduction is such that a small icon isrecognizable to the user, although not necessarily readable, as well asbeing distinguishable by analysis engine 52 when the icon is user torequest a document.

For example, text reducer 110 does not merely reduce the text region. Tomake the document more differentiable to analysis engine 52, eachcharacter in a text region is replaced by a block font character. FIG. 3shows an example of a page 300 to be iconized. FIG. 4 shows an icon 400formed from page 300 (icon 400(A) is the icon shown full size; icon400(B) is the icon shown at a size found in a typical guide page). InFIG. 4, each character is replaced with a block character. While thisrenders the text unreadable, it does not need to be readable to berecognizable to the user. It also does not need to be readable toanalysis engine 52 if the actual characters are not used as features. InHull, for example, the characters are not used, but the pattern of wordlengths in the text are. By replacing characters with blocks as withicon 400, the word lengths are more likely to be preserved throughsubsequent copying or facsimile transmission of icon 400. The blocks canbe generated in several ways. One way, suitable for use with structureddocuments such as a word processing file, where the image is created bydisplaying a font character for each representation of a character, isto use a font of blocks. For example, a structured document might havethe ASCII code ‘65’ stored therein. A display driver would use that codeas an index into a font table and retrieve the character image “A” fordisplay. To generate blocks, the font table could be replaced withcharacter images which are all blocks, except for the space character,of course (and possibly other punctuation). If the page 100 is notrepresented as a structured document, but is just an image of the page(e.g., a bit map), each character can be bounded by a bounding box andthe bounding box filled in. This eliminates the need for an intermediatecharacter recognition step and its accompanying errors.

For even greater reproduceability, the words can be replaced with lines.To do this, the bounding boxes for the characters and interword spacesare determined. The bounding boxes are then spaced evenly and replacedwith a line segment. Thus each line of text is replaced with collinearline segments each with a length proportional to the number ofcharacters in the word being replaced.

As a refinement to the process of reduction, segmentation analyzer 108might separately categorize large font text and small font text. If thisis done, the large font text would be processed by a reducer whichreduces the text proportionally, with or without character recognition,so the user can still read the large font text when it is reduced. Thesmall font text would be processed by a reducer as explained above toreplace characters with blocks or lines.

With multi-color documents, the color might be preserved in thereduction of the example page to an icon.

Another refinement is to place each block character along the textbaseline, and to provide a fixed spacing between each block within eachword. This can aid the image processing feature detection of thecharacter block pattern.

In one embodiment, if line art is detected, such as in graphics region106, graphics reducer 112 processes the line art differently thanphotographic art. Line art is graphics which are relatively well definedand do not use shades of grey. Line art is reduced according to astructure preserving operation such as line thinning to further enhancetheir identifiability.

FIG. 5 is an illustration of the relative size of icons. FIG. 5 shows aguide page 500 containing icons similar to icon 102 and space for 49icons per guide page (seven rows of seven icons each; 98 icons ifdouble-sided) although icons might be made even smaller. With only tensuch double-sided guide pages, a user can scan icons for example pagesfor nearly 1000 documents, which might total several tens of thousandsof pages stored by document server 10. The user need not even maintainthe guide pages, if the document server can print out the guide pages ondemand. The icons could even be stored with the documents in documentdatabase 36 or with the key data in document index table 38.

To retrieve a copy of a document containing page 100, from which icon102 is derived, a user might just circle icon 102 on guide page 500 andsubmit the guide page to document server 10. Guide page 500 could alsobe used for document management. For example, document server 10 mightbe programmed to accept a guide page with icons “X”ed out to signal thatthe corresponding documents should be deleted from document database 36.A document server might try to automatically distinguish guide pagesfrom other pages by doing a test retrieval of a document from whatappears to be an icon. If a document is retrieved in such a manner, thepage is assumed to be a guide page.

FIG. 6 is a flow chart of a process for storing documents in a documentdatabase according to the present invention. The process begins when auser presents documents to a document server. In step S1, a document isscanned if it is not already in electronic form. In step S2, thedocument is stored in a document database and keys are extracted fromthe document, if used. As explained above, one method of extractingredundant features for use as keys is taught by Hull. Next, an examplepage is selected from the document to be used for icon generation (S3).If the selection of an example page is automatic, the document servermight always select the first page of the document, examine the pages ofthe document to locate rarely found features, such as graphs in adatabase of mostly text documents or choose to have all pages selected.Otherwise, the selection of a memorable example page can be made by theuser.

Once the example page is selected, the example page is segmented to forma map, or layout, of the regions of the example page (S4). Each of thesesegments are reduced according to reduction processes specific to theimage type of the region (S5), and the reduced regions are thenreassembled into an electronic representation of the icon (S6). If anicon ID is used, it is added to the electronic representation (S7).

The electronic icon is added to the other electronic icons associatedwith a guide page for the icons (S8) and a guide page with the icon isprinted as needed (S9). A guide page is typically not printed after eachicon, but is printed when the process of document entry is complete orwhen a guide page is full.

Once the icon is either printed or stored with other icons for laterprinting, the document server checks for more documents (S10). If moredocuments are to be processed, the process continues back at step S1,otherwise the process of document storage ends.

FIG. 7 is a flow chart of the process for retrieval of documents storedaccording to the process shown in FIG. 6. The retrieval process beginswhen a user presents an icon to the document server where the iconrepresents an example page of a target document to be retrieved and theicon is scanned (step R1). Next, in step R2, the document serverdetermines whether an icon identifier is available from the scannedimage of the icon (either an icon specific identifier or a guide pageidentifier and an icon location on the guide page). If the applicationallows for different guide pages to be used on different systems, thedocument server might also check to see if the icon identifier is validfor the system on which it is used. The document server might also usethe content of the icon itself as a cross reference to verify that theicon identifier is correct.

If the icon identifier is not present or used, the document serveranalyzes the icon contents as described above to extract features usedby the search engine to search for matching documents (R3). If the iconidentifier is used, the icon identifier is extracted and provided to thesearch engine (R4). In either case, the search engine searches for thetarget document (R5), and checks to see if more than one matchingdocument was found (R6). If more than one document was found, the useris presented with indications of the matching documents and asked toselect the target document from among them (R7). Once a single documentis selected, it is returned as the target document (R8).

In this way, users can easily store and extract documents from adocument server using just a few guide pages of icons. In view of theabove description of the document server, several applications and usesare suggested. For example, a user might provide documents to a digitalcopier/scanner which is part of a document server system. The documentscould then be scanned and the original pages of the documents are erasedand recycled, and the user supplied with a guide page containing iconsfor the documents (not necessarily in a one-to-one relationship).

Although it is not necessarily the preferred embodiment, the icons couldbe electronically stored in the document server and a guide page couldthen be printed out on request. If the icons are electronically stored,then it is a simple matter to print out updated guide pages as new iconsare added. However, the advantage of having a portable guide page islost where the user relies solely on the document server to print out aguide page each time a document is to be retrieved. The document servermight also provide a guide page update facility, where a user submits aguide page, which is scanned and recycled and a new guide page isprinted.

When the user desires to retrieve or delete a document, the user circles(for retrieval) or crosses out (for deletion) the appropriate icons onthe guide page, possibly with a pen having machine detectable ink.Alternatively, a small hand-held scanner could be used to scan singleitems. The document server then locates the relevant documents and takesthe appropriate action, either deleting them or presenting them to theuser. Instead of retrieving entire documents, of course, the user couldindicate the specific pages desired.

With a stable set of icons on a guide page, a user would become morecomfortable and familiar with the layout and location of the icons, thusleading to a user being able to quickly locate a document by rememberingthe location of the icon on the guide page and immediately identifyingit.

Icons can also be used as a paper interface to eliminate the need forother types of data entry. For example, an iconic guide page mightinclude icons for a list of people, each showing their name and picture.To use the guide page, the user would circle one of the images anddocument server 10 would return a set of information associated withthat icon. In a specific application, the guide page shows all theindividuals in an work group and the guide page with an icon circled isprovided to document server 10 to indicate who the document is to berouted to. Thus, the document server would identify the user from a listof users using either an icon identifier or the contents of the icon,look up a network or electronic mail address for the destination userand send the document to that destination.

The above description is illustrative and not restrictive. Manyvariations of the invention will become apparent to those of skill inthe art upon review of this disclosure. The scope of the inventionshould, therefore, be determined not with reference to the abovedescription, but instead should be determined with reference to theappended claims along with their full scope of equivalents.

What is claimed is:
 1. A method of routing a document electronically,comprising the steps of: generating a plurality of destination icons toform a guide page, wherein a destination icon is a representation of aspecific destination; selecting at least one destination icon from saidguide page; marking said selected destination icon; extracting at leastone feature of said selected destination icon; using said extractedfeature to electronically identify at least one electronic address ofsaid specific destination correspondinf to said extracted freature; andelectonically routing said document to said electronic address.
 2. Themethod of claim 1, wherein said specific destinations correspond tospecific people.
 3. The method of claim 1, wherein said specificdestinations correspond to specific groups of people.
 4. The method ofclaim 1, wherein said destination icons are comprised of destinationsymbols.
 5. The method of claim 1, wherein said destination icons arecomprised of pictures and names.
 6. The method of claim 1, furthercomprising the step of entering said selected destination icon into adocument server system prior to said extraction step.
 7. The method ofclaim 6, wherein said entering step further comprises the step ofscanning, said selected destination icon.